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1.  Introduction 


Accurate  environment  perception  is  critical  for  autonomous  robots  to  plan  paths  on 
traversable  terrain  and  avoid  object  collision  during  navigation.  While  many  sen¬ 
sors  have  been  used  to  help  with  perception,1-4  speedups  in  image  processing  have 
allowed  vision-based  perception  to  emerge  in  mobile  robots.4-10  Visual  classifica¬ 
tion  is  an  important  task  for  many  applications,  but  is  particularly  useful  for  path 
planning  because  visual  data  allow  robots  to  perceive  a  large  area  of  the  environ¬ 
ment  at  once.  However,  a  variety  of  challenging  properties  associated  with  visual 
data  can  make  it  difficult  to  train  classifiers.  These  challenges  may  include  changes 
in  illumination,  scale,  perspective,  color,  and  background  clutter. 

Classifiers  can  learn  to  account  for  these  factors,  but  generally  need  large  amounts 
of  training  data  to  sample  and  model  these  variations.  While  collecting  visual  data 
is  a  trivial  task,  the  raw  data  contain  no  label  information  from  which  supervised 
classifiers  can  learn.  Label  collection  is  a  burdensome  task  for  human  annotators  as 
it  requires  human  intervention  to  assign  semantic  labels  to  training  instances,  and 
unfortunately,  may  not  be  a  one-time  event.  For  example,  to  ensure  the  highest  qual¬ 
ity  visual  perception  for  mobile  robots,  training  data  should  be  collected  from  the 
environment  where  navigation  tasks  will  be  performed.  Thus,  each  domain  change 
requires  new  data  collection  and  labeling. 

Furthermore,  state-of-the-art  deep  learners11,12  now  rely  on  millions  of  images  for 
learning.  Label  collection  for  the  ImageNet  data  set13  was  performed  in  parallel  via 
crowdsourcing,  but  took  a  combined  total  of  approximately  19  person-years.14  This 
process  is  even  more  demanding  for  scene  labeling  classifiers,15-17  because  distinct 
regions  in  images  must  be  outlined  before  assigning  labels.  This  is  infeasible  for  ap¬ 
plications  with  limited  labeling  time  and  resources.  We  define  the  time  for  a  human 
to  label  a  new  set  of  training  data  as  adaptation  latency.  In  our  robot  navigation  ex¬ 
ample,  this  represents  the  time  robots  are  unable  to  navigate  autonomously  because 
perception  models  are  being  adapted. 

To  keep  up  with  the  demand  for  labeled  training  data,  more  semi- supervised  and 
unsupervised  label  collection  techniques  need  to  be  developed  to  help  reduce  the 
overall  labeling  workload.  Specifically,  we  have  identified  4  objectives  important  in 
the  design  of  such  label  collection  techniques. 
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1)  Learning  a  label  set:  Since  the  training  data  are  initially  unlabeled,  we  as¬ 
sume  the  set  of  visual  concepts  in  the  data  is  also  initially  unknown.  The  pro¬ 
cess  of  learning  a  label  set,  commonly  referred  to  as  concept  discovery, 18-20 
determines  the  classes  a  classifier  will  learn  to  recognize.  Only  concepts  with 
labeled  training  examples  can  be  accurately  accounted  for  at  test  time. 

2)  Reducing  the  workload:  The  motivation  of  this  work,  reducing  labeling  ef¬ 
fort,  is  an  objective  itself.  We  refer  to  the  degree  in  which  labeling  effort  is 
reduced  as  labeling  efficiency.  Labeling  efficiency  is  discussed  with  respect 
to  1)  the  overall  effort  required  to  label  training  data  (i.e.,  user  interactions) 
and  2)  the  overall  time  required  to  label  training  data. 

3)  Maximizing  label  counts:  Trivially,  reducing  labeling  effort  can  be  achieved 
by  simply  labeling  fewer  training  samples.  However,  a  small  labeled  set  may 
not  be  sufficient  to  train  high-performing  classifiers.  We  refer  to  this  objective 
as  exploitation  of  the  training  data. 

4)  Maintaining  accuracy:  There  is  always  the  possibility  for  error  when  hu¬ 
mans  assign  labels  to  visual  data.  For  example,  gravel  may  be  mistaken  for 
asphalt  or  a  user  may  mistype  car  as  cat.  The  fraction  of  nonerroneous  la¬ 
bels  is  defined  as  label  accuracy ,  and  the  fraction  of  label  errors  represents 
label  noise.  High  label  accuracy  is  important  because  noise  creates  confusion 
during  classifier  training.  Beyond  human  error,  most  label  noise  discussed  in 
this  report  is  introduced  by  frameworks  that  employ  group-based  labeling  to 
improve  labeling  efficiency.  Visual  data  believed  to  represent  the  same  visual 
concept  are  grouped  together,  and  the  entire  group  is  assigned  a  single  label 
to  describe  the  data  (see  Fig.  6).  When  the  group  of  images  in  fact  represents 
multiple  classes,  label  noise  is  introduced. 

Each  objective  plays  a  critical  role  in  determining  the  success  of  classifier  training. 
However,  the  frameworks  that  have  emerged  to  help  alleviate  labeling  effort  tend 
to  focus  on  a  subset  of  the  labeling  objectives  we  laid  out  instead  of  working  to¬ 
ward  all  4.  Active  learning  frameworks21-25  reduce  the  workload  by  labeling  only 
a  subset  of  samples,  so  of  course  they  do  not  maximize  the  label  count.  Moreover, 
active  learning  systems  typically  assume  the  label  set  is  known  in  advance,  and  run 
the  risk  of  increasing  the  total  work  time  by  introducing  latency  while  classifiers 
are  retrained.  Group-based  labeling  techniques  such  as  partitional  clustering,26-28 
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incremental  clustering,18,29  active  clustering,30,31  and  topic  modeling32,33  reduce  the 
workload  by  labeling  groups  instead  of  instances,  but  can  suffer  from  high  label 
noise  or  reduced  efficiency  if  the  label  noise  is  removed. 

Unfortunately,  adaptation  latency  has  yet  to  be  discussed  in  existing  supervised 
multi-concept  visual  perception  systems  used  in  robotics  applications.1,5-7  Anno¬ 
tation  of  images  is  performed  as  a  necessary  but  time-consuming  step  to  train  su¬ 
pervised  classifiers.  U  nsupervised  o  r  s  elf-supervised  a  pproaches  h  ave  b  een  used 
to  eliminate  labeling  effort,3,10,34-37  but  produce  a  limited  environment  vocabulary 
(e.g.,  traversable  vs.  non-traversable).  These  techniques  do  not  generalize  well  to 
more  complex  navigation  tasks  that  require  a  richer  set  of  scene  semantics,  such  as 
verbal  navigation  commands  from  humans.38 

Our  work  is  motivated  by  scenarios  that  need  more  than  a  binary  understanding  of 
environments,  and  that  have  limited  time  and  resources  to  collect  this  information. 
We  discuss  our  labeling  framework  designed  to  model  and  balance  each  of  the  4 
objectives  to  yield  fast  label  collection  that  trains  high-performing  visual  classifiers. 
Specifically,  our  approach  is  a  group-based  labeling  technique  where  groups  are 
selected  from  a  hierarchical  clustering  of  the  data.  By  maintaining  a  hierarchical 
clustering,  our  approach  establishes  a  space  of  groupings  that  map  to  coarse  and 
fine-grained  visual  concepts.  This  allows  the  system  to  search  the  hierarchy  and 
discover  groups  that  match  the  concept  granularity  of  the  classifier  and  thereby 
keep  label  noise  to  a  minimum.  These  groups  are  identified  by  searching  for  local 
structural  changes  in  the  hierarchy.  This  selection  heuristic  is  combined  with  criteria 
that  reward  exploration  of  the  search  space  to  discover  new  visual  concepts  and 
labeling  large  clusters  to  maximize  efficiency  and  total  label  count.  Overall,  these 
measures  model  our  defined  objectives,  identify  clusters  from  the  hierarchy  that 
can  be  labeled  with  little  effort,  produce  minimal  label  noise,  and  most  importantly 
collect  data  that  train  high-performing  visual  classifiers. 

Using  several  outdoor  urban  environments,  we  show  that  visual  perception  trained 
with  our  efficient  label  collection  technique  allows  for  reliable  path  planning  and 
successful  navigation.  We  compare  the  approach  to  a  fully  supervised  labeling  ap¬ 
proach  by  evaluating  pixel  labeling  rate,  pixel-wise  classification,  and  autonomous 
navigation  via  road  terrain  with  respect  to  adaptation  latency. 
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2.  Related  Work 


2.1  Label  Collection 

There  are  3  dominant  approaches  used  to  address  the  labeling  workload  problem: 
crowdsourcing,  active  learning,  and  group-based  labeling.  Crowdsourcing  via  mar¬ 
ketplaces  such  as  Amazon  Mechanical  Turk  has  become  a  popular  way  to  collect 
large  sets  of  annotated  visual  data.13,39  This  technique  allows  the  label  collection 
process  to  be  split  into  smaller  units  of  labor,  and  these  tasks  are  distributed  to  a  set 
of  human  resources  who  work  in  parallel.  Crowdsourcing  has  several  shortcomings. 
First,  this  approach  can  be  quite  costly  for  data  sets  with  millions  of  images  since 
users  are  paid  per  labeling  task.  Second,  it  has  been  found  that  users  are  highly  in¬ 
consistent  in  their  labeling.40  These  labeling  inconsistencies  require  reconciliation 
and  verification  steps  that  also  require  human  effort. 

Active  learning  is  an  instance-based  labeling  approach  that  tries  to  identify  an  infor¬ 
mative  subset  of  training  samples  to  label  with  a  supervised  learner  in  the  loop.  Se¬ 
lection  criteria  include  uncertainty  sampling,22-24  Gaussian  process  models,21  and 
information  density.25  Active  learning  reduces  the  number  of  image  instances  a 
user  must  label,  but  often  requires  a  priori  knowledge  for  classifier  seeding  and 
introduces  latency  while  iteratively  retraining  classifiers.  Active  learning  has  been 
combined  with  crowdsourcing,41,42  but  Vijayanarasimhan  et  al.41  note  that  retrain¬ 
ing  classifiers  after  each  labeling  queries  creates  latency  that  makes  the  parallelism 
of  crowdsourcing  less  evident. 

Group-based  labeling  reduces  workload  by  providing  a  single  label  to  a  group  of 
samples.  Clustering26-28  and  topic  modeling32,33  form  groups  through  bottom-up 
discovery,  requiring  no  a  priori  knowledge  of  the  unlabeled  data.  These  techniques 
try  to  find  a  1-to-l  mapping  between  groups  and  visual  concepts.  Unfortunately, 
visual  data  properties  make  partitioning  the  data  difficult  and  groups  often  contain 
data  from  multiple  classes.  Assigning  the  dominant  class  label  to  an  entire  set  of 
images  that  represent  multiple  concepts  can  create  label  noise. 

Label  noise  can  be  reduced  at  the  cost  of  additional  labeling  effort  and  labeling 
latency.  Active  clustering  improves  group  coherency  by  iteratively  collecting  con¬ 
straints  indicating  whether  2  images  represent  the  same  class.30,31,43  This  constraint 
information  is  used  to  augment  feature  representations  and  recluster  data.  Lee  and 
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Grauman  cluster  the  “easiest"  subset  of  unlabeled  data  and  label  a  single  group  on 
each  iteration  to  improve  overall  group  coherency.18  A  largest  subset  labeling  ap¬ 
proach  is  used  by  Galleguillos  et  al.  in  their  iterative  labeling  approach  with  multiple 
kernel  metric  learning.29  Largest  subset  labeling  eliminates  label  noise  by  asking  a 
user  to  remove  images  from  a  group  that  do  not  represent  the  dominating  class  label. 
Each  of  these  techniques  introduces  reclustering  latency  after  each  labeling  query. 

Related,  there  has  been  work  on  how  to  reduce  labeling  effort  for  video  data.  Xie 
et  al.  introduce  a  label  transfer  approach  where  coarse  3-D  annotations  of  street 
scenes  can  be  transferred  to  2-D  images.44  Other  semi-supervised  label  propaga¬ 
tion  for  video  streams  has  also  been  achieved  with  random  forests45  and  a  mixture 
of  temporal  trees.46  These  approaches  use  the  information  encoded  by  temporal 
consistency  to  reduce  labeling  effort,  but  are  not  compatible  for  large  sets  of  nonse¬ 
quential  training  images  (e.g.,  environment  A  in  our  experiments). 

2.2  Visual  Perception  in  Robotics _ 

Vision  provides  valuable  perception  for  mobile  robots.  Terrain  and  obstacle  clas¬ 
sification  are  particularly  important  to  help  determine  traversability.  For  example, 
visual  terrain  classification  has  been  used  to  identify  when  legged  robots  should 
change  gaits,6,7  and  aerial  robots  can  identify  possible  landing  sites  or  be  used  to 
communicate  with  ground  robots  when  working  in  teams.5  Visual  perception  is  also 
being  used  for  path  planning  on  ground  robots.  Haselich  et  al.  fuse  3-D  laser  scans 
and  camera  images  to  perceive  road ,  rough ,  and  obstacle  terrain  classes.1  Haselich 
et  al.  is  the  first  to  mention  the  inability  to  adapt  quickly  to  new  environments  due 
to  the  requirement  of  reannotation. 

Consequently,  a  significant  amount  of  visual  perception  path  planning  research  fo¬ 
cuses  on  semi-supervised,  self-supervised,  and  online  learning.  Teleoperation  has 
been  used  to  define  optimal  routes  to  infer  path  and  nonpath  labels  for  visual 
classifiers.47  Ross  et  al.  identify  obstacles  with  an  unsupervised,  online  technique 
that  compares  visual  appearance  and  structure  to  learned  environment  models.10 
Roncancio  et  al.  adapt  a  pretrained  supervised  visual  classifier  online  to  identify 
traversable  and  non-traversable  paths.9 

Other  techniques  pair  vision  with  complimentary  sensors.  Visual  features  have  been 
used  to  enhance  radar  ground  prediction.3  The  correspondence  between  visual  fea- 
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tures  and  a  robot’s  navigation  experience  (e.g.,  slippage)  was  used  to  identify  traver¬ 
sable  terrain.34  Lookingbill  et  al.  used  a  reverse  optical  flow  technique  to  update 
visual  classifiers  with  the  appearance  of  obstacles  beyond  the  range  of  stereo 
vision.36  Other  self-supervised  learning  examples  include  combining  vision  and 
LiDAR.35’37 

These  examples  adapt  terrain  classifiers  without  the  time-consuming  labeling  pro¬ 
cess.  However,  the  lack  of  human  supervision  has  limited  most  of  this  work  to 
binary  classification  (e.g.,  traversability).  Unfortunately,  these  approaches  do  not 
extend  to  more  complex  multiclass  tasks  such  as  verbal  navigation  commands  from 
human  to  robot.38 

3.  Reducing  Adaptation  Latency 

Our  labeling  system  is  designed  to  be  quick  and  efficient  so  new  sets  of  labeled 
training  data  can  be  easily  collected  by  a  single  human  annotator.  Our  approach, 
called  hierarchical  cluster  guided  labeling  (HCGL),  iteratively  selects  clusters  to 
label  from  a  hierarchical  clustering  of  the  data  samples.  Before  motivating  our  use 
of  hierarchical  clustering  and  discussing  details  of  our  group  selection  criteria,  we 
overview  the  traditional  supervised  labeling  approach  used  to  label  environment 
images. 

3.1  Supervised  Labeling 

Supervised  label  collection  produces  high  quality  labeled  data,  but  is  time  consum¬ 
ing  for  2  reasons:  1)  training  sets  are  typically  large  and  2)  images  capture  multiple 
terrains  and  objects  in  the  scene  that  need  to  be  localized  before  label  assignment. 
Image  annotation  tools  such  as  LabelMe48  have  been  used  to  facilitate  supervised 
labeling.  LabelMe  allows  annotators  to  precisely  outline,  via  mouse  clicks,  and 
assign  labels  to  each  distinct  region.  Figure  1  is  an  example  of  a  training  image 
(left),  required  outlining  (middle)  and  labeled  output  (right;  see  class/color  legend 
in  Fig.  18)  using  LabelMe.  Labeling  250  images  requires  over  20  h  of  effort  (dis¬ 
cussed  in  Section  4),  causing  high  latency  during  domain  changes  and  inhibits  fast 
adaptation. 
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Fig.  1  Example  image,  outline  annotation  and  label  results  of  a  supervised  labeling  process 


The  goal  of  this  work  is  to  train  supervised  multi-concept  visual  classifiers  using 
large  amounts  of  labeled  environment  data  with  limited  human  interaction.  We  use 
the  HCGL  framework,49  originally  designed  to  cluster  and  label  groups  of  single¬ 
concept  images.  We  discuss  HCGL  and  several  modifications  made  to  the  frame¬ 
work  to  better  suit  real-world  environment  data.  After  discussing  the  efficient  label 
collection  technique,  we  compare  HCGL  to  supervised  labeling  with  LabelMe  to 
demonstrate  the  speedup  achieved. 

3.2  Hierarchical  Motivation 

As  previously  mentioned,  many  group-based  labeling  techniques  have  emerged. 
One  major  disadvantage  of  group-based  labeling  is  the  addition  of  label  noise  when 
images  in  the  same  group  represent  multiple  visual  concepts.  We  hypothesize  that 
label  noise  collected  by  partitional  group-based  labeling  approaches  is  caused  in 
large  part  because  the  unsupervised  grouping  algorithm  learns  feature  patterns  that 
map  to  a  different  concept  granularity  than  the  concepts  of  interest  for  the  classifi¬ 
cation  task. 

Visual  concepts  are  hierarchical.  Figure  2  includes  example  images  from  2  single¬ 
concept  benchmark  classification  data  sets,  13-Scenes50  and  Caltech- 256.51  The  la¬ 
bels  in  red  indicate  the  concept  granularity  these  benchmark  data  sets  would  use 
to  evaluate  supervised  classifiers.  However,  note  that  all  labels  associated  with  an 
image  are  valid  visual  concept  descriptions,  and  represent  a  progression  of  descrip¬ 
tions  from  coarse-grained  to  fine-grained. 
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(a)  Scene  images  and  labels  (b)  Object  images  and  labels 


Fig.  2  Examples  of  coarse-  and  fine-grained  visual  concepts  for  benchmark  data  images, 
where  labels  in  red  indicate  the  label  used  for  benchmark  classification  evaluation 


The  label  set  of  interest  to  a  classifier  (which  is  task  dependent)  is  denoted  as  y, 
where  |^|  =  K.  In  the  object  labeling  example,  dog  e  y.  The  goal  of  partitional 
grouping  techniques  is  to  find  a  partition  of  K  groups  such  that  each  group  maps  to  a 
label  in  y.  Grouping  is  influenced  by  feature  representation  and  intraclass  and  inter¬ 
class  similarity,  which  are  hard  to  manage  explicitly  with  an  unsupervised  grouping 
algorithm.  In  many  cases,  a  grouping  algorithm  may  identify  a  pattern  in  the  data 
that  represents  a  coarser-  or  finer-grained  concept  than  those  defined  in  y.  For  ex¬ 
ample,  in  the  13-Scenes  data  set  groups  of  images  representing  coast  and  highway 
share  a  coarse-grained  open  quality  since  the  horizon  is  visible  in  both  classes. 
Alternatively,  images  might  be  grouped  at  the  fine-grained  level  of  dog,  cow,  and 
sheep  when  the  task  is  only  interested  in  animal. 

Instead  of  forcing  groupings  to  occur  at  a  particular  level  of  granularity,  we  use  hi¬ 
erarchical  clustering  to  maintain  a  spectrum  of  pattern  similarities  encoded  in  image 
groupings.  Figure  3  illustrates  this  approach  with  5  classes  from  the  13-Scenes  data 
set.  We  denote  the  hierarchy  as  K.  Nodes  colored  black  correspond  to  groups  that 
contain  images  from  multiple  scene  classes.  The  remaining  colors  indicate  groups 
of  images  from  a  single  scene  class.  There  is  an  obvious  division  of  the  hierarchy 
into  4  groups:  tall  building  (green),  living  room  (blue),  suburb  (yellow),  and  the 
coarse-grained  concept  of  open  (dashed  outline)  previously  mentioned.  The  many 
smaller,  interweaved  partitions  of  the  coast  (red)  and  highway  (orange)  classes,  as 
subtrees  of  the  open  partition,  are  evidence  of  high  interclass  similarity. 
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Fig.  3  Hierarchical  clustering  of  5  classes  from  the  13-Scenes  data  set.  Node  colors 
indicate  what  class  the  images  in  the  cluster  represent:  yellow  -suburb,  blu  e-living  room, 
green -tall  building,  red  -coast,  and  orange-highway.  The  section  of  the  hierarchy  outlined  by 
the  dotted  line  highlights  the  data  grouped  as  the  coarse-grained  visual  concept  open. 


Maintaining  the  hierarchical  structure  provides  many  unique  benefits  to  our  label¬ 
ing  framework.  First,  the  number  of  clusters  does  not  have  to  be  known  in  advance. 
Instead,  the  hierarchy  allows  each  class  to  breakdown  coherently  at  its  own  pace 
(i.e.,  at  different  levels  in  H).  Second,  the  hierarchical  relationships  provide  infor¬ 
mation  about  how  feature  patterns  change  as  data  are  further  refined  into  smaller 
groups.  Later  in  this  section,  we  describe  how  we  use  these  relationships  to  define 
interestingness  and  guide  our  search  for  clusters  that  should  be  labeled.  Third,  the 
hierarchical  clustering  avoids  latency  during  labeling.  If  the  system  asks  the  user 
to  label  a  cluster  that  is  too  high  in  the  tree  (i.e.,  too  coarse-grained  for  the  label 
set  y),  the  user  marks  it  as  too  coarse  and  the  system  immediately  has  access  to 
the  subtree  below  it,  representing  possible  finer-grained  concepts.  No  reclustering 
is  necessary,  and  no  classifier  has  to  be  retrained  while  the  user  waits. 

3.3  Iterative  Selection  and  Labeling 

Hierarchical  clustering  is  not  the  solution  to  the  label  collection  problem  but  rather 
an  encoding  of  information  that  the  HCGL  framework  uses  to  solve  the  labeling 
problem.  At  a  very  high  level,  HCGL  is  simply  an  iterative  group-based  labeling 
technique.  That  is,  HCGL  selects  a  group  from  the  hierarchy  and  displays  the  im¬ 
ages  in  the  group  to  a  user  who  assigns  it  a  label.  This  process  repeats  until  there 
are  no  more  groups  to  be  labeled  or,  more  likely,  the  user  runs  out  of  time. 

Figure  4  illustrates  a  single  labeling  interaction  in  the  iterative  HCGL  process  and 
its  use  of  majority  group-based  labeling.  Since  most  images  in  the  group  are  of  a 
dog,  the  group  is  labeled  as  such.  The  images  in  this  group  that  do  not  represent  dog 
become  examples  of  label  noise.  Annotators  are  encouraged  to  only  supply  labels 


Approved  for  public  release;  distribution  is  unlimited. 


9 


to  groups  of  images  that  are  dominated  by  a  single  label  in  y,  to  keep  the  level  of 
noise  down.  When  HGCL  mistakenly  selects  a  group  that  is  not  dominated  by  a 
single  label,  the  annotator  labels  it  as  too  coarse,  telling  HCGL  that  it  selected  a 
node  too  high  in  the  hierarchy  for  this  branch  of  the  tree. 

f - Group  Selection - v 


V 


^ -  Add  label  to  Tf - 

Fig.  4  Illustration  of  a  labeling  iteration  performed  by  HCGL 


We  note  that  the  iterative  labeling  process  may  terminate  before  all  the  training 
data  have  been  labeled.  This  framework  is  designed  around  the  assumption  that  in 
many  real-world  applications,  labeling  a  large  set  of  training  data  in  its  entirety  is 
not  feasible.  Thus,  the  group  selection  technique  employed  by  HCGL  is  a  primary 
contribution  of  this  work.  Groups  need  to  be  selected  in  such  a  way  that  a  diverse 
and  accurate  set  of  training  data  can  be  collected  even  when  users  only  have  a  small 
amount  of  time  to  devote  to  labeling. 

Group-based  labeling  is  beneficial  for  these  types  of  real-world  applications  since 
multiple  images  are  labeled  with  a  single  labeling  query,  leading  to  efficiency  gains. 
However,  unlike  partitional  grouping  approaches,  which  create  a  set  of  disjoint 
groups  and  ask  the  user  to  assign  a  label  to  each  group,  HCGL  must  select  groups 
from  H,  which  contains  nondisjoint  groups  and  some  redundant  information.  For 
example,  if  cluster  c  in  Fig.  5  contains  dog  images,  it  must  be  true  that  its  children, 
cr  and  ci,  represent  the  same  concept  since  they  each  contain  a  subset  of  images 
represented  in  c.  Thus,  the  selection  order  of  groups  in  H  is  meaningful  because 
descendants  of  a  labeled  group  (according  to  the  structure  in  7-L)  can  inherit  labels 
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without  being  viewed  and  labeled  by  the  annotator.  The  remainder  of  this  section 
focuses  on  the  novel  technique  HCGL  uses  to  select  groups  from  TL  during  the 
iterative  labeling  process. 


Fig.  5  Illustration  that  depicts  the  relationships  for  group  c  in  a  local  neighborhood  of  TL, 
including  its  parent  p  and  left  and  right  children,  c;  and  cr 


3.4  Group  Selection 

HCGL  group  selection  is  designed  to  balance  the  labeling  objectives  of  discovery, 
efficiency,  exploitation,  and  accuracy.  To  do  this,  we  define  the  following  heuristic 
criteria  for  groups  in  TL'. 

1)  Interestingness:  the  degree  of  structural  change  seen  after  a  split  in  TL 

2)  Exploitation:  the  number  of  samples  that  would  receive  labels 

3)  Exploration:  the  likelihood  that  a  group  represents  a  different  concept  from 
those  previously  labeled 

These  3  criteria  are  discussed  individually  in  detail,  followed  by  a  discussion  of 
how  the  criteria  are  combined  to  create  our  novel  group  selection  criteria. 

3.4.1  Interestingness 

The  hierarchy  encodes  groups  of  data  that  span  a  spectrum  of  concept  granulari¬ 
ties,  but  the  classification  task,  and  thus  the  labeling  task,  focuses  on  a  specific  but 
initially  unknown  set  of  labels  y.  This  means  that  for  HCGL  to  be  successful,  the 
algorithm  must  find  locations  (image  groups)  in  TL  that  are  most  likely  to  repre¬ 
sent  a  single  label  in  y.  As  stated  previously,  we  assume  no  a  priori  knowledge 
of  y.  Instead,  HCGL  compares  the  image  features  in  hierarchically  related  groups 
to  measure  local  structural  change.  More  specifically,  interestingness  is  defined  as 
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the  degree  of  change  at  a  split  in  %.  The  idea  is  that  feature  similarities  encoded 
in  %  map  to  coarse-  and  fine-grained  visual  concepts.  When  underlying  patterns  of 
similarity  change,  it  is  likely  that  a  visual  concept  transition  has  also  occurred. 

HCGL  compares  the  structural  change  between  a  cluster,  c,  and  its  parent,  p  (this 
relationship  can  be  seen  in  Fig.  5).  The  internal  structure  of  c  is  derived  from  its 
data  matrix,  Xc,  where  each  column  is  an  image  represented  by  a  d-dimensional 
feature  vector: 

%l,l  x2,l  '  '  '  Xs,l 

%1,2  x2,2  ■  ■  ■  xs,2 

xl,d  x2,d  '  '  '  xs,d 

The  data  are  mean  centered,  Xc  =  Xc  —  Xc,  and  the  covariance  matrix  of  Xc  is 
decomposed  and  represented  by  its  eigenvectors,  Vc,  using  singular  value  decom¬ 
position: 


(2) 

(3) 


This  representation  of  internal  structure  models  the  direction  of  variance  in  the  data. 
Given  that  the  diagonal  entries  of  Ac  are  sorted  in  descending  order,  the  first  eigen¬ 
vector,  vcl,  in  Vc  provides  the  axis  of  maximum  variance  for  c. 

The  structural  change  between  c  and  p  is  calculated  as  the  angle  between  the  first 
eigenvectors,  v&  and  vp\ ,  of  c  and  p,  respectively.  Larger  angles  indicate  greater 
differences  in  directions  of  variance  and  are  therefore  more  interesting.  Formally, 
interestingness  derived  from  structural  change  for  group  c  is  defined  as  the  cosine 
distance, 

A(c)  =  1.0  -  (vcl,vpl),  (4) 

which  yields  values  on  the  interval  of  [0.0, 1.0]  with  large  A  values  representing 
large  angles.  The  idea  behind  this  type  of  selection  is  to  order  groups  by  the  strength 
of  their  potential  concept  transition. 
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3.4.2  Exploitation 

Exploitation  selection  is  based  on  the  number  of  unlabeled  samples  in  a  cluster. 
This  criterion  is  designed  to  label  larger  clusters  first  to  emphasize  the  efficiency 
objective  and  collect  a  large  set  of  labeled  data  quickly.  In  other  words,  cluster  c  in 
Fig.  5  has  a  higher  exploitation  value  than  both  q  and  cy  since  each  contain  a  subset 
of  c’s  images,  resulting  in  fewer  labeled  images  per  labeling  query. 

However,  the  exploitation  value  is  not  necessarily  equal  to  the  number  of  images  in 
a  group,  because  some  of  the  descendant  groups  may  already  have  been  labeled  in 
a  previous  iteration  of  HCGL.  This  set  of  previously  labeled  groups  is  denoted  as 
C.  Formally,  the  exploitation  score  for  group  q  is  defined  as 

£(q)  =  |q|  -  \{xi\xi  £  Cj  ,  Cj  £  £,  Cj  C  Ci}\.  (5) 

Since  exploitation  is  based  on  £,  the  £  values  change  after  each  labeling  iteration. 

3.4.3  Exploration 

Exploration  spreads  group  labels  throughout  H  to  better  explore  the  feature  space 
as  a  way  of  discovering  groups  that  represent  concepts  that  have  yet  to  be  labeled. 
Exploration  values  iteratively  change  throughout  the  labeling  process  because  they 
are  computed  with  respect  to  C.  Specifically,  the  exploration  score  for  cluster  q  is 
the  shortest  path  in  the  hierarchy  between  it  and  the  nearest  labeled  cluster, 

<f>(ci)  =  min  path-length(q, Cj),  (6) 

Cjec 

where  the  path  length  between  q  and  cs  is  the  combined  number  of  edges  traversed 
by  each  node  to  reach  their  first  common  ancestor.  For  example,  q  and  cy  in  Fig.  5 
have  a  path  length  of  2  where  their  first  common  ancestor  is  c.  Exploration  ordering 
labels  clusters  with  the  longest  path  length  first  (i.e.,  groups  that  are  least  similar  to 
groups  already  discovered  and  labeled). 

3.4.4  Multi-Objective  Combination 

The  3  heuristic  criteria  are  designed  to  emphasize  different  objectives  when  collect¬ 
ing  labeled  training  data  for  supervised  classifiers.  Since  each  objective  is  impor¬ 
tant  to  the  problem,  we  define  a  multi-objective,  rank-based  combination  criterion. 
In  this  combination,  the  set  of  unlabeled  groups,  U ,  are  ranked  according  to  the 
3  criteria  individually,  and  the  rankings  are  linearly  combined  to  produce  a  multi- 
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objective  ranking  score  for  each  cluster  q: 


■0(cj)  =  Pi  rank(A(q),  (A(ci)  |  q  G  U}) 

+  p2  rank(f  (q),  (£(q)  |  q  EU})  (7) 

+  p3  rank(0(q),  (</>(q)  |  q  G  U}). 

Each  ^  is  a  weight  for  its  ordering  objective  such  that  Pi  +  @2  +  @3  =  1-0.  For  all 
experiments  in  this  report,  the  objectives  are  weighted  evenly,  Pi  =  P2  =  Pz  =  \- 
The  rank  function  is  passed  a  group’s  score  and  the  set  of  scores  for  all  other  groups 
in  U ,  and  returns  the  group’s  rank  with  respect  to  the  set.  The  group  with  the  highest 
combined  ranking  is  selected  as  the  next  labeling  query. 

Any  one  of  the  heuristics  could  be  used  independently  to  select  the  next  group 
to  label;  however,  exploitation  and  exploration  selections  are  not  very  interesting 
on  their  own.  Exploitation  essentially  performs  a  breadth  first  search  of  H,  and 
exploration  will  mostly  choose  leaf  nodes  in  H  as  these  produce  the  longest  path 
lengths.  To  allow  the  technique  to  generalize  to  any  of  these  selection  heuristics,  the 
selection  criteria  is  not  performed  on  the  entire  unlabeled  set,  but  a  set  of  the  most 
interesting  groups  from  H ,  denoted  as  <S.  This  set  <S  is  constructed  by  comparing 
a  group’s  interestingness  score  to  the  distribution  statistics  of  all  interestingness 
scores  from  unlabeled  groups.  Formally, 


We  refer  to  the  groups  with  structural  change  values  at  least  one  standard  deviation 
beyond  the  mean  as  outliers,  and  thereby,  the  most  interesting  set  of  groups: 

S  =  {Ci  |  A(q)  >  A  +  crA}.  (10) 


Algorithm  1  summarizes  the  generalized  iterative  HCGL  framework,  where  the 
selection-criterion  requirement  defined  on  line  1  could  represent  any  of  the  follow¬ 
ing:  HCGL-Interestingness  (Eq.  4),  HCGL-Exploitation  (Eq.  5),  HCGL-Exploration 
(Eq.  6),  or  HCGL-Combined  Ranks  (Eq.  7). 
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Algorithm  1  Hierarchical  cluster  guided  labeling 

Require:  "H,  selection-criterion  =  {A,  £,  4>,  ip} 

1 :  U  =  {cj  |  Cj  e  U} 

2:  while  &&  user==True  do 

3:  threshold  =  A  +  a  a 

4:  S  =  {ci  |  A(cj)  >  threshold,  q  e  [/} 

5:  Update  selection-criterion  scores  Vc,  e  W 

6:  iS  =  sort(S,  selection-criterion) 

7:  label  query  — >•  m 

8:  Update  C  and  U 


3.5  HCGL  for  Multi-Concept  Images _ 

In  the  context  of  semantic  scene  labeling,  every  pixel  in  a  multi-concept  image 
needs  to  be  assigned  a  label.  Since  HCGL  clusters  data  under  the  assumption  that 
each  training  sample  represents  a  single  concept,  images  must  be  segmented  into 
multiple  regions.  We  oversegment  images  into  approximately  150  regions  using 
simple  linear  iterative  clustering  (SLIC).52  Oversegmentation  is  performed  to  en¬ 
sure  that  true  region  boundaries  are  observed  in  the  training  samples.  Each  segment 
becomes  a  training  sample  that  is  described  by  a  feature  vector  composed  of  LAB 
color  histograms,  local  binary  patterns,53  a  200-dimensional  codebook  of  scale- 
invariant  feature  transform  (SIFT)  descriptors,54  and  normalized  region  coordinates. 
The  resulting  feature  vectors  are  then  hierarchically  clustered  using  agglomerative 
clustering.  For  all  experiments,  we  use  Ward’s  linkage  and  Euclidean  distance  to 
create  the  hierarchy.  An  illustration  of  the  HCGL  framework  running  on  image  re¬ 
gions  is  shown  in  Fig.  6.  Node  colors  map  to  a  class  in  the  label  set  y  and  black 
wedges  represent  the  percentage  of  noise  in  each  cluster  (images  not  representing 
the  dominating  class). 
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Fig.  6  Visualization  of  the  HCGL  process  on  multi-concept  environment  data 


4.  Speed  and  Classification  Experiments 

To  demonstrate  the  speed  and  real-world  feasibility  of  HCGL,  we  present  results 
of  experiments  labeling  real-world  data  in  outdoor  environments.  Specifically,  we 
present  results  showing  pixel  labeling  rate  and  classification  accuracy  as  a  func¬ 
tion  of  interaction  time,  where  interaction  time  is  the  total  time  a  human  spends 
waiting  for  and  answering  labeling  queries  from  a  system.  Thus,  interaction  time 
includes  any  latency  introduced  by  techniques  that  recluster  or  retrain  classifiers. 
This  evaluation  is  motivated  by  autonomous  mobile  robotics  applications  that  fre¬ 
quently  change  domains  or  environments,  and  need  to  quickly  train  classifiers  to 
learn  new  terrains  and  objects  with  relatively  low  operator  overhead. 

We  use  3  real-world  environments  to  demonstrate  the  speed  and  performance  of 
HCGL  when  collecting  labels  for  multi-concept  visual  perception.  The  environ¬ 
ments  are  outdoor  urban  training  facilities  that  include  multiple  types  of  terrain, 
buildings,  cars,  and  other  objects.  Training  data  for  environment  A  were  collected 
with  a  high  dynamic  range  camera.  Images  were  taken  at  5  different  time  blocks 
over  2  days  from  53  locations  in  the  environment.55  Training  data  for  environment 
B  and  C  were  captured  via  teleoperation  using  the  robot  described  in  Section  5.  The 
training  set  from  Environment  B  is  the  combination  of  data  collected  on  3  consecu¬ 
tive  days  and  is  therefore  much  larger  than  the  other  sets.  Performance  on  this  data 
set  shows  how  HCGL  scales  with  increasing  training  set  sizes.  An  overview  of  the 
data  sets  is  provided  in  Table  1  and  example  images  are  provided  in  Fig.  7. 
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Table  1  Details  of  the  real-world  environment  data  sets 


Environment 

No.  Training  images 

Label  set  (30 

A 

274 

asphalt ,  building,  concrete,  grass,  gravel,  object,  sky,  tree 

B 

1,982 

building,  car,  grass,  object,  road,  sidewalk,  sky,  tree 

C 

268 

building,  car,  curb,  grass,  object,  road,  sidewalk,  sky,  tree 
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Fig.  7  Example  images  from  the  3  environment  training  sets 


For  these  experiments,  we  compare  HCGL  to  the  supervised  labeling  baseline,  La- 
belMe,  where  training  images  are  labeled  in  random  order.  We  do  not  make  direct 
comparisons  to  other  existing  efficient  labeling  frameworks  because  most  frame¬ 
works  do  not  provide  interfaces  for  real-time  labeling.  We  hypothesize  that  the 
latency  introduced  by  data  reclustering  may  inhibit  the  real-world  practicality  of 
such  interfaces.  Pixel-wise  labeling  and  classification  accuracy  are  evaluated  as  a 
function  of  labeling  interaction  time  (i.e.,  adaptation  latency)  to  show  the  speed  at 
which  techniques  can  collect  multi-concept  scene  labels  for  visual  classifiers. 

4.1  Labeling  Speed  and  Label  Accuracy 

The  first  experiment  looks  at  how  fast  HCGL  and  LabelMe  assign  labels  to  pixels 
in  the  training  images.  Figure  8  shows  the  percentage  of  labeled  pixels  as  a  func¬ 
tion  of  time.  There  is  a  large  performance  gap  seen  across  all  data  sets.  Given  a 
fixed  training  time,  HCGL  collects  around  6  to  7  times  the  amount  of  label  infor¬ 
mation  as  LabelMe.  Labeling  interaction  time  for  environment  B  is  on  the  order  of 
hours  because  each  of  the  3  days  of  training  data  were  labeled  separately  and  then 
combined. 
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Labeling  Interaction  Time  (Minutes) 


(a)  Environment  A 


(b)  Environment  B 


(c)  Environment  C 


Fig.  8  Comparison  of  labeling  rate  using  HCGL  and  LabelMe  on  the  3  data  sets.  Dashed 
blue  lines  depict  the  percentage  of  labeled  pixels  that  received  correct  labels  from  HCGL 


Assigning  labels  quickly  is  important,  but  recall  that  to  achieve  this  speed,  HCGL 
incurs  some  label  noise  as  a  result  of  majority  labeling.  The  dashed  blue  lines  in  the 
plots  show  the  percentage  of  pixels  that  received  accurate  labels  from  HCGL  (deter¬ 
mined  using  the  labels  collected  by  LabelMe).  This  line  represents  approximately 
5%-10%  pixel  label  noise:  a  small  fraction  for  a  large  gain  in  efficiency. 

4.2  Pixel-Wise  Classification 

Next,  to  test  environment  perception  using  labels  from  HCGL  and  LabelMe,  we 
train  a  Hierarchical  Inference  Machine  (HIM),16  an  approach  for  scene  parsing  and 
region  classification.  HIMs  incorporate  both  feature  descriptors  and  contextual  cues 
computed  at  multiple  scales  within  the  scene.  Images  are  decomposed  into  a  hier¬ 
archy  of  nested  superpixel  regions,56  where  regions  at  the  bottom  provide  localized 
discriminative  information  and  those  at  the  top  provide  global  context.  The  predic¬ 
tor  is  a  decision  forest  regressor  with  10  trees.  Features  extracted  from  superpix¬ 
els  include  SIFT  descriptors,54  LAB  colorspace  statistics,  texture  information,  and 
statistics  on  the  size  and  shape  of  superpixel  regions. 
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4.2.1  Environment  A 


Pixel-wise  classification  accuracy  is  compared  on  a  testing  set  from  Environment 
A,  which  consists  of  265  images.  This  is  the  only  environment  data  set  with  a  large, 
hand-labeled  testing  set.55  Classification  evaluation  is  performed  incrementally  af¬ 
ter  every  15  min  in  which  a  user  assigns  labels  to  the  training  set.  Figure  9  shows 
the  overall  pixel  accuracy  for  environment  A.  Even  though  HCGL  introduces  small 
amounts  of  label  noise,  the  larger  volume  of  labeled  training  data  allows  HCGL  to 
train  higher  performing  HIMs  than  LabelMe  through  210  min  of  labeling  interac¬ 
tion.  HCGL  labeling  is  terminated  at  this  point  to  depict  scenarios  with  limited  time 
to  devote  to  label  collection. 
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Labeling  Interaction  (Minutes) 


Fig.  9  Comparison  of  the  overall  pixel  classification  accuracy  for  environment  A 


Overall  pixel  classification  accuracy  can  be  skewed  by  classes  with  more  pixels 
(pixel  distributions  can  be  seen  in  Fig.  10).  Thus,  we  also  evaluate  per-class  classi¬ 
fication  accuracy  and  find  that  HCGL  performs  similarly  or  better  than  LabelMe  for 
all  classes  but  one,  as  shown  in  Fig.  1 1 .  The  object  class  is  the  least  represented  in 
the  data  and  is  composed  of  many  diverse  things  (e.g.,  light  poles,  traffic  cones,  and 
cargo  boxes).  The  low  intraclass  similarity  makes  it  difficult  for  samples  to  group 
together  in  HCGL.  Instead,  samples  of  object  are  dispersed  almost  randomly  across 
the  hierarchical  clustering  and  are  often  mislabeled  as  part  of  clusters  dominated 
by  other  classes.  As  a  result,  HCGL  achieves  lower  accuracy  than  LabelMe  on  this 
class.  However,  this  is  a  poorly  defined  “other”  class,  and  is  difficult  for  LabelMe 
as  well.  With  a  fully  labeled  training  set  (1,602  min),  LabelMe  achieves  only  ap¬ 
proximately  18%  classification  accuracy  for  the  object  class. 
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Training  Data  Pixel  Frequency  - 
Environment  A 
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Fig.  10  Breakdown  of  class  distributions  across  all  pixels  in  the  training  set  of 

environment  A 
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Fig.  11  Comparison  of  class-specific  classification  accuracy  for  environment  A 
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In  applications  with  limited  time  for  label  collection,  it  can  be  tempting  to  run 
HCGL  with  the  exploitation  heuristic  only  (Eq.  5)  to  collect  as  many  labels  as  pos¬ 
sible  in  the  allotted  time.  We  compare  HCGL-exploitation  and  HCGL-combined 
ranks  to  once  again  show  the  importance  of  balancing  all  ordering  criteria  during 
the  labeling  process  even  under  applications  with  limited  labeling  time. 

As  designed,  HCGL-exploitation  focuses  on  labeling  large  groups  of  data  quickly. 
This  results  in  a  small  number  of  classes  receiving  a  large  number  of  labels  early  in 
the  labeling  process  because  of  skewed  class  distributions.  Most  of  the  labeled  train¬ 
ing  samples  represent  either  sky,  grass,  or  building.  This  ultimately  produces  worse 
HIM  classifiers  than  HCGL-combined  ranks  and  LabelMe.  A  subset  of  classifica¬ 
tion  results  are  shown  in  Fig.  12,  and  Fig.  13  shows  some  examples  of  classified 
images  from  the  test  set. 
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(a)  Overall  pixels 
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(b)  Class  with  small  distribution  of  pixels. 


LabelMe 

HCGL-Combined  Ranks 
HCGL-Exploitation 


Sky 


Labeling  Interaction  (Minutes) 

(c)  Class  with  large  distribution  of  pixels. 


Fig.  12  Comparison  of  classification  accuracy  for  environment  A  using  the 
HCGL-exploitation  selection  heuristic 
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Fig.  13  Examples  of  classified  test  images  from  environment  A  that  illustrate  the 
weaknesses  of  HCGL-exploitation  and  LabelMe  compared  to  HCGL-combined  ranks 
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4.3  Environment  B 


Labeled  test  sets  from  environments  B  and  C  are  not  available,  but  we  provide  a 
qualitative  pixel-wise  classification  comparison  of  HCGL  and  LabelMe.  Figure  14 
shows  6  example  test  images  from  environment  B.  We  use  LabelMe  to  create 
ground  truth  for  these  images,  seen  in  the  bottom  row.  Classifiers  are  trained  us¬ 
ing  labeled  data  at  the  third  markers  from  Fig.  8b.  The  selected  examples  show 
2  instances  where  the  classifiers  perform  similarly,  an  example  where  HCGL  per¬ 
forms  slightly  worse  than  LabelMe  (column  3),  and  the  last  3  columns  are  exam¬ 
ples  of  HCGL’s  superior  performance  and  illustrate  the  common  mistakes  made 
by  the  classifier  trained  using  LabelMe.  Specifically,  the  LabelMe  classifier  often 
misidentifies  terrain  further  from  the  camera.  This  allows  robots  to  make  immediate 
decisions,  but  negatively  impacts  long-term  path  planning.  Qualitatively  it  can  also 
be  seen  that  HCGL  commonly  misclassifies  trees  and  certain  objects  as  sky,  which 
are  less  costly  for  our  navigation  task.  These  mistakes  occur  because  the  tree  and 
object  classes  are  less  represented  than  sky  in  the  training  set,  so  fewer  examples  are 
collected  by  HCGL.  However,  the  overall  HCGL  performance  on  these  classes  is 
still  qualitatively  high.  Overall,  HCGL  collects  significantly  more  label  information 
even  with  25%  less  human  interaction  time  and  trains  higher-performing  classifiers. 


|  Tree 
|  Object  | 


Car 
Road  I 


Building 

Sidewalk 


Fig.  14  Qualitative  comparison  between  HCGL  and  LabelMe  with  a  test  set  from 
environment  B 
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5.  Real-Time  Navigation  Experiments 


Pixel-wise  accuracy  quantitatively  compares  techniques  on  static  data,  but  task- 
based  evaluation  judges  perception  relative  to  the  end  goal  of  successful  navigation 
in  outdoor  environments.  We  compare  several  visual  classifiers  trained  using  la¬ 
bels  collected  by  HCGL  and  LabelMe  based  on  their  ability  to  provide  perception 
information  to  a  real-time  mapping  and  navigation  framework.57 

5.1  Task  Description 

Our  live  navigation  task  requires  a  robot  to  use  visual  perception  to  plan  paths  be¬ 
tween  waypoints  using  specified  terrain.  These  terrains  are  defined  based  on  the 
composition  of  the  road  at  testing  locations.  We  use  road  traversal  because  roads 
are  designed  to  provide  navigation  guidance  to  vehicles.  For  example,  roads  direct 
vehicles  around  buildings  and  hazards  like  bodies  of  water.  Our  experiments  emu¬ 
late  these  scenarios  by  defining  waypoints  (seen  in  Fig.  15)  such  that  the  most  direct 
path  to  goals  is  not  along  a  road. 


(a)  Environment  A  (b)  Environment  B 

Fig.  15  Navigation  waypoint  maps  for  environments 


Classifiers  are  compared  based  on  successes  and  failures  during  multiple  trials  of 
the  navigation  task,  where  outcomes  are  defined  as  follows: 

•  Success:  the  robot  autonomously  traverses  between  waypoints  using  only 
road  terrain  without  hitting  objects. 

•  Success  with  minor  errors:  the  robot  traverses  between  waypoints  but  either 
1)  traverses  on  non-road  terrain  for  a  short  duration  or  2)  requires  operator 
intervention  at  least  once  but  no  more  than  twice  for  small  adjustments  in 
location  or  direction  due  to  potential  object  collision  or  planner  failure. 
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•  Failure:  the  robot  cannot  plan  and  execute  a  road  traversal  even  with  minimal 
operator  intervention,  visual  perception  has  significant  false-positive  errors 
indicating  no  road  path,  or  constant  planner  updates  result  in  no  progress 
toward  the  goal. 


5.2  Hardware 


The  robot  used  in  this  work,  the  Clearpath  Husky  seen  in  Fig.  16,  is  a  39  x  26  x  14 
inch  wheeled  platform  that  is  limited  to  a  maximum  velocity  of  1  m/s.  The  Husky 
employs  a  MicroStrain  3DM-GX3-25  IMU,  a  Garmin  18  GPS,  and  2  Quad-Core 
Intel  i7  Mini-ITX  processing  payloads,  each  with  a  256-GB  SSD  running  Ubuntu 
14.04,  ROS  Indigo,  and  experimental  software.  The  Husky  has  a  Velodyne  HDL- 
32E  LiDAR,  which  generates  360°  point  clouds  at  a  range  of  70  m  and  an  accuracy 
of  up  to  ±2  cm.  Finally,  the  Husky  collects  image  data  using  a  Prosilica  GT2750C, 
a  6-megapixel  charge-coupled  device  (CCD)  color  camera. 


Garmin  GPS 
receiver 


2  Quad  Core  i7 
computers 


PicoStation  wireless 
I  "access  point 

f' 

Velodyne  HDL  32E 
LiDAR 


X-box  joystick  receiver 


MicroStrain  GX3-35 
IMU 


Prosilica  GX2750C 
camera,  8mm  lens, 
polarizing  filter 


0  batteries 
behind  custom  front  plate 
(computer  and  sensor  power) 


Fig.  16  Hardware  configuration  of  the  Clearpath  Husky  robot 


5.3  Mapping  and  Navigation 

Our  robot  test  platform  employs  a  mapping  and  navigation  system  to  enable  accu¬ 
rate  motion  between  desired  waypoints.  The  mapping  system,  dubbed  OmniMap- 
per,  consumes  measurements  from  LiDAR  for  relative  motion  estimation  and  loop 
closure  through  integrated  color  pixel  (ICP),58  GPS  measurements,59  and  camera 
images.  A  keyframe  is  created  with  each  measurement  as  the  robot  moves  through 
its  environment;  the  robot’s  pose  at  this  keyframe  is  optimized  through  GTSAM60 
to  minimize  residual  error  from  all  measurements. 

Approved  for  public  release;  distribution  is  unlimited. 


27 


A  2-D  local  occupancy  grid  is  created  from  each  laser-scan  keyframe  through  ray¬ 
tracing,  where  sufficient  height  above  the  ground  is  registered  as  an  obstacle.  When 
a  new  keyframe  is  added  or  when  a  significant  update  is  made  to  the  map  causing 
keyframe  poses  to  change,  the  2-D  occupancy  grids  are  composited  together  into  a 
negative  log-odds  grid  and  thresholded  into  an  obstacle  map  as  in  Fig.  17b. 

A  keyframe  is  also  created  for  each  classified  image,  and  the  pose  of  this  record  is 
updated  with  the  mapping  process  such  as  with  loop  closures  or  GPS  measurements. 
Whenever  a  new  obstacle  map  is  created,  additional  cells  are  marked  as  “obstacle” 
if  those  cells,  when  projected  into  classified  images,  overlap  with  pixels  classified 
as  one  of  the  defined  non-road  terrains  or  an  object  class.  In  Fig.  17a,  only  asphalt 
and  concrete  make  up  the  road  for  this  testing  location. 


(a)  Environment  (b)  LiDAR  map  (c)  Vision  map 


Fig.  17  Example  obstacle  maps  for  location  2  in  the  environment  A.  Darker  regions 
indicate  obstacles  and  non-road  terrain. 


The  corners  of  each  map  grid  cell  (10  x  10  cm)  are  projected  into  all  classified 
images  that  observe  that  cell  within  a  range  of  7  m.  The  classified  images  are  recti¬ 
fied  so  the  projected  comers  define  a  quad  in  the  classified  image.  Each  pixel  in  the 
projected  quad  has  a  label  from  the  classifier  and  votes  for  that  class  to  be  applied 
to  the  ground  cell.  The  ground  cell  is  assigned  the  label  with  the  highest  number  of 
votes.  If  this  label  does  not  represent  road  for  navigation,  the  occupancy  grid  cell  is 
given  an  obstacle  value  to  prevent  traversal  through  that  cell.  As  seen  in  Fig.  17c, 
visual  perception  helps  produce  cost  maps  with  specific  terrain  information  (e.g., 
gravel  regions  are  darker  and  avoided  during  path  planning;  discussed  further  in 
Section  5). 

A  kinematically  feasible  path  is  computed  from  the  robot’s  current  location  to  the 
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goal  location  using  the  Search-Based  Planning  Library  (SBPL)61  using  a  set  of  mo¬ 
tion  primitives  generated  to  match  the  Husky’s  kinematics.  A  smoothed  local  plan 
is  chosen  that  follows  the  global  plan  closely  while  avoiding  local  obstacles  not  yet 
present  in  the  global  map.  Planner  failures  occur  if  the  occupancy  grid  prohibits  an 
obstacle  free  path  to  the  goal.  This  occurs  in  our  experiments  due  to  false-positive 
non-road  classifications  on  road  terrain.  See  Gregory  et  al.62  for  more  implementa¬ 
tion  details  of  the  mapping  and  navigation  systems  used  in  this  work. 

5.4  Navigation  Results:  Environment  A _ 

Environment  A  is  the  primary  location  used  for  comparative  evaluation  since  La- 
belMe  was  used  to  label  its  entire  training  set.55  Four  classifiers  are  trained  and 
compared.  We  compare  the  labeling  techniques  given  the  same  amount  of  labeling 
interaction  time.  HCGL-150  and  LabelMe-150  represent  classifiers  trained  after 
150  min  of  labeling,  which  reflects  scenarios  where  limited  labeling  time  is  avail¬ 
able.  This  is  just  under  one-tenth  of  the  estimated  total  time  (1,602  min)  required 
to  label  the  entire  training  set  with  LabelMe.  To  demonstrate  results  given  no  time 
restrictions,  a  classifier  is  trained  using  the  entire  training  set,  denoted  as  LabelMe- 
1602. 

The  final  classifier  is  meant  to  show  the  benefits  of  using  training  data  representing 
the  most  recent  state  of  a  robot’s  environment  and  how  HCGL  easily  facilitates 
the  labeling  of  data  upon  arrival  to  a  new  or  changed  environment.  We  supplement 
the  existing  training  set  (collected  several  years  ago)  with  231  additional  images 
collected  during  our  experiments  (disjoint  from  testing  locations).  Labeling  was 
performed  for  30  min  with  HCGL,  and  approximately  27%  of  the  pixels  in  the 
new  images  were  assigned  labels.  Without  ground  truth  for  this  set,  the  amount 
of  collected  label  noise  is  unknown.  This  set  of  labeled  data  is  combined  with  the 
labeled  data  from  HCGL-150  to  train  the  final  classifier,  denoted  as  HCGL-150+30. 

Navigation  experiments  are  performed  at  2  locations  in  the  environment.  Location 
1  is  illustrated  with  red  waypoints  in  Fig.  15a,  and  roads  are  composed  of  gravel, 
concrete,  and  asphalt.  Thus,  path  planning  must  avoid  grass  terrain  (the  shortest 
path  between  waypoints)  and  several  objects  near  the  edge  of  the  grass  and  road. 
Each  trial  represents  a  traversal  from  one  waypoint  to  the  other  and  are  performed 
in  both  directions.  Trials  were  run  across  multiple  days  and  different  times  of  day  to 
capture  performance  under  varying  environment  conditions.  Table  2  compares  the 
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performance  of  each  classifier  at  this  first  location. 


Table  2  Summary  of  navigation  results  for  location  1  (red  waypoints)  in  environment  A 


%  Successes 

Label  Model 

No  Errors 

Minor  Errors 

%  Failures 

HCGL-150 

0.500 

0.000 

0.500 

LabelMe-150 

0.333 

0.167 

0.500 

LabelMe-1602 

0.250 

0.250 

0.500 

HCGL-150+30 

0.875 

0.125 

0.000 

HCGL-150  and  LabelMe-150  perform  similarly  and  inconsistently  with  a  50%  fail¬ 
ure  rate.  LabelMe-1602  exhibits  the  same  failure  rate,  but  also  displays  more  mi¬ 
nor  errors  during  its  successful  trials.  LabelMe-1602  uses  the  most  labeled  data  to 
learn  class  boundaries  with  respect  to  the  training  set,  but  performs  worse  because 
the  learned  class  boundaries  changed.  The  classifiers  trained  after  150  min  likely 
learned  less  definitive  class  boundaries,  making  the  environment  changes  less  detri¬ 
mental.  Some  observed  changes  from  the  training  data  include  grass  length,  cloud 
coverage,  and  illumination.  HCGL- 150+30,  on  the  other  hand,  performs  the  navi¬ 
gation  task  very  reliably,  because  it  represents  a  classifier  that  has  adapted  to  the 
changed  environment  with  new  and  additional  training  data.  Minor  errors  involved 
the  robot  trying  to  plan  a  shortest  path  through  the  grass,  entering  the  grass  for  a 
brief  moment  before  backing  out,  and  successfully  planning  a  road  traversal  route. 
These  results  demonstrate  the  positive  impact  of  rapid  label  collection,  even  if  a 
small  fraction  is  noisy,  when  new  training  data  are  needed  to  adapt  and  improve 
visual  perception. 

Qualitative  evaluation  of  visual  perception  shows  the  labeling  models  produce  clas¬ 
sifiers  that  make  different  mistakes.  Figure  18  includes  examples  explicitly  chosen 
to  depict  some  of  the  worst  classified  images  by  one  or  more  models.  HCGL-150 
had  many  false-positive  concrete  classifications,  which  can  be  seen  best  in  columns 
1,  3  and  4.  Columns  3  and  5  highlight  that  LabelMe-150  produced  more  false- 
positives  of  object  and  building  classes  on  what  was  actually  road  terrain.  LabelMe- 
1602  has  cleaner  results  than  the  previous  models,  but  also  often  misclassified 
gravel  as  object  (seen  in  column  3),  and  tended  to  misclassify  trees  as  buildings 
(seen  in  columns  1  and  2).  Although  still  not  perfect  classification,  HCGL-150+30 
has  the  most  accurate  results  compared  to  the  ground  truth,  which  yielded  its  su¬ 
perior  navigation  success  and  highlights  the  importance  of  being  able  to  quickly 
collect  large  amounts  of  new  labeled  training  data  given  environment  changes. 
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I  Gravel-floor 


Fig.  18  Visual  perception  examples  for  each  labeling  model.  Images  in  the  first  3  columns 
are  from  the  first  location  (red  waypoints),  and  the  last  3  columns  are  images  from  the  second 
location  (blue  waypoints) 


The  second  location  is  depicted  in  Fig.  15a  with  blue  waypoints.  At  this  location, 
roads  are  composed  of  concrete  and  asphalt,  whereas  gravel  terrain  (shortest  path 
between  waypoints)  is  not  road.  Along  the  shortest  road  path  are  2  objects  (traffic 
cones)  that  the  robot  must  also  avoid.  Terrain  classification  for  classes  with  high 
interclass  similarity  is  important  for  successful  traversal  during  this  test. 

Comparisons  are  made  between  LabelMe-1602  and  HCGL- 150+30;  the  most  suc¬ 
cessful  models  at  the  first  location  in  terms  of  successes  and  qualitative  evaluation. 
Results  are  summarized  in  Table  3  and  indicate  that  this  navigation  task  is  more 
challenging.  However,  HCGL- 150+30  is  still  able  to  successfully  navigate  the  ma¬ 
jority  of  the  time  with  only  minor  errors.  Most  failures  and  errors  at  this  location 
were  caused  by  classification  confusion  of  asphalt  and  gravel.  This  can  be  seen  in 
the  last  3  columns  of  Fig.  18. 
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Table  3  Summary  of  navigation  results  for  location  2  (blue  waypoints)  in  environment  A 


%  Successes 

Label  Model 

No  Errors 

Minor  Errors 

%  Failures 

LabelMe-1602 

0.000 

0.000 

1.000 

HCGL- 150+30 

0.375 

0.250 

0.375 

5.5  Navigation  Results:  Environment  B 

We  use  environment  B,  seen  in  Fig.  15b,  to  further  test  HCGL  label  collection 
in  new  domains.  In  this  environment,  roads  are  composed  of  a  single  terrain  type 
labeled  as  road ,  and  all  other  terrains  and  objects  should  be  avoided  during  path 
planning.  Training  data  for  this  environment  were  not  available  prior  to  the  exper¬ 
iment,  so  data  were  collected  upon  arrival.  We  chose  to  focus  our  navigation  trial 
experiments  on  labels  collected  using  HCGL  to  show  the  consistency  of  the  system 
across  multiple  environments. 

An  in-depth  discussion  and  analysis  of  experiments  in  this  environment  is  omitted, 
but  examples  of  the  robot  perception  in  this  environment  were  seen  in  Fig.  14.  Over 
15  navigation  trials  were  performed  between  both  waypoint  sets  without  any  failure 
cases.  Only  minor  path  planning  errors  in  a  few  trials  caused  the  robot  to  traverse  on 
the  edge  of  the  grass  where  it  meets  the  road.  These  successes  are  used  to  confirm 
that  small  amounts  of  label  noise  collected  by  HCGL,  in  exchange  for  fast  label 
collection,  does  not  negatively  impact  path  planning. 

6.  Conclusion  and  Future  Work 

Real-time  visual  perception  for  mobile  robots  is  only  as  useful  as  its  ability  to 
quickly  adapt  to  changing  environments.  We  discussed  an  efficient  label  collection 
technique,  HCGL,  for  multi-concept  environment  data.  It  was  shown  that  while 
HCGL  trades  some  label  accuracy  for  reduced  adaptation  latency,  this  label  noise 
does  not  significantly  impact  visual  perception  for  navigation.  Using  this  technique, 
high-quality  visual  perception  can  be  obtained  in  new  environments  with  only  a  few 
hours  of  labeling  effort  from  a  human  annotator. 

The  multi-concept  semantics  provided  by  HCGL  allow  this  work  to  generalize  to 
more  complex  variations  of  path  planning  tasks.  This  includes  assigning  variable 
costs  to  terrains  based  on  robot  capabilities  and  path  planning  with  verbal  naviga¬ 
tion  cues  given  during  human-robot  interaction.  Future  work  also  includes  augment- 
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ing  HCGL  to  be  even  more  effective  through  online  label  collection  and  adaptation. 
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